10 research outputs found
Spoken Language Intent Detection using Confusion2Vec
Decoding speaker's intent is a crucial part of spoken language understanding
(SLU). The presence of noise or errors in the text transcriptions, in real life
scenarios make the task more challenging. In this paper, we address the spoken
language intent detection under noisy conditions imposed by automatic speech
recognition (ASR) systems. We propose to employ confusion2vec word feature
representation to compensate for the errors made by ASR and to increase the
robustness of the SLU system. The confusion2vec, motivated from human speech
production and perception, models acoustic relationships between words in
addition to the semantic and syntactic relations of words in human language. We
hypothesize that ASR often makes errors relating to acoustically similar words,
and the confusion2vec with inherent model of acoustic relationships between
words is able to compensate for the errors. We demonstrate through experiments
on the ATIS benchmark dataset, the robustness of the proposed model to achieve
state-of-the-art results under noisy ASR conditions. Our system reduces
classification error rate (CER) by 20.84% and improves robustness by 37.48%
(lower CER degradation) relative to the previous state-of-the-art going from
clean to noisy transcripts. Improvements are also demonstrated when training
the intent detection models on noisy transcripts
Confusion2vec 2.0: Enriching Ambiguous Spoken Language Representations with Subwords
Word vector representations enable machines to encode human language for
spoken language understanding and processing. Confusion2vec, motivated from
human speech production and perception, is a word vector representation which
encodes ambiguities present in human spoken language in addition to semantics
and syntactic information. Confusion2vec provides a robust spoken language
representation by considering inherent human language ambiguities. In this
paper, we propose a novel word vector space estimation by unsupervised learning
on lattices output by an automatic speech recognition (ASR) system. We encode
each word in confusion2vec vector space by its constituent subword character
n-grams. We show the subword encoding helps better represent the acoustic
perceptual ambiguities in human spoken language via information modeled on
lattice structured ASR output. The usefulness of the proposed Confusion2vec
representation is evaluated using semantic, syntactic and acoustic analogy and
word similarity tasks. We also show the benefits of subword modeling for
acoustic ambiguity representation on the task of spoken language intent
detection. The results significantly outperform existing word vector
representations when evaluated on erroneous ASR outputs. We demonstrate that
Confusion2vec subword modeling eliminates the need for retraining/adapting the
natural language understanding models on ASR transcripts
Scaling Laws for Discriminative Speech Recognition Rescoring Models
Recent studies have found that model performance has a smooth power-law
relationship, or scaling laws, with training data and model size, for a wide
range of problems. These scaling laws allow one to choose nearly optimal data
and model sizes. We study whether this scaling property is also applicable to
second-pass rescoring, which is an important component of speech recognition
systems. We focus on RescoreBERT as the rescoring model, which uses a
pre-trained Transformer-based architecture fined tuned with an ASR
discriminative loss. Using such a rescoring model, we show that the word error
rate (WER) follows a scaling law for over two orders of magnitude as training
data and model size increase. In addition, it is found that a pre-trained model
would require less data than a randomly initialized model of the same size,
representing effective data transferred from pre-training step. This effective
data transferred is found to also follow a scaling law with the data and model
size
Discriminative Speech Recognition Rescoring with Pre-trained Language Models
Second pass rescoring is a critical component of competitive automatic speech
recognition (ASR) systems. Large language models have demonstrated their
ability in using pre-trained information for better rescoring of ASR
hypothesis. Discriminative training, directly optimizing the minimum
word-error-rate (MWER) criterion typically improves rescoring. In this study,
we propose and explore several discriminative fine-tuning schemes for
pre-trained LMs. We propose two architectures based on different pooling
strategies of output embeddings and compare with probability based MWER. We
conduct detailed comparisons between pre-trained causal and bidirectional LMs
in discriminative settings. Experiments on LibriSpeech demonstrate that all
MWER training schemes are beneficial, giving additional gains upto 8.5\% WER.
Proposed pooling variants achieve lower latency while retaining most
improvements. Finally, our study concludes that bidirectionality is better
utilized with discriminative training.Comment: ASRU 202
Personalization for BERT-based Discriminative Speech Recognition Rescoring
Recognition of personalized content remains a challenge in end-to-end speech
recognition. We explore three novel approaches that use personalized content in
a neural rescoring step to improve recognition: gazetteers, prompting, and a
cross-attention based encoder-decoder model. We use internal de-identified
en-US data from interactions with a virtual voice assistant supplemented with
personalized named entities to compare these approaches. On a test set with
personalized named entities, we show that each of these approaches improves
word error rate by over 10%, against a neural rescoring baseline. We also show
that on this test set, natural language prompts can improve word error rate by
7% without any training and with a marginal loss in generalization. Overall,
gazetteers were found to perform the best with a 10% improvement in word error
rate (WER), while also improving WER on a general test set by 1%